Note: This rmarkdown file has been uploaded to https://github.com/yytsong/challenge-hamlet. If you have any questions about this report, please email yuri.songyt@gmail.com.
Step 1: Load text file into the session and inspect - Use read_delim to load the txt file into the session, and remove the character introduction part - Define column names and added new labeling column for ACT and SCENE - Fill down for character, act, and scene to complete the label - Inspect the head of the tibble
setwd("~/Dropbox/challenge-hamlet/data")
hamlet <- read_delim(file = "hamlet.txt", delim = "\t",col_names = FALSE, skip = 77) %>%
rename("character" = "X1", "text" = "X2") %>%
mutate(
act = case_when(
str_detect(character, pattern = "ACT ") ~ character,
text == "HAMLET" ~ text,
TRUE ~ NA_character_),
scene = case_when(
str_detect(character, pattern = "ACT |SCENE ") ~ character,
text == "HAMLET" ~ text,
TRUE ~ NA_character_),
character = ifelse(text == "HAMLET", "play_title", str_remove(character, pattern = ":"))
) %>%
fill(act, .direction = "down") %>%
fill(character, .direction = "down") %>%
fill(scene, .direction = "down")
# Inspection
head(hamlet)
## # A tibble: 6 x 4
## character text act scene
## <chr> <chr> <chr> <chr>
## 1 play_title HAMLET HAMLET HAMLET
## 2 play_title <NA> ACT I ACT I
## 3 SCENE I Elsinore. A platform before the castle. ACT I SCENE I
## 4 SCENE I [FRANCISCO at his post. Enter to him BERNARDO] ACT I SCENE I
## 5 BERNARDO Who's there? ACT I SCENE I
## 6 FRANCISCO Nay, answer me: stand, and unfold yourself. ACT I SCENE I
Step 2: Further filter out unnecessary lines, where lines highlighted entrance and exeunt, or a line only contain “|”.
Sometime, a description goes through multiples where open and close square brackets are not in the same line. Occasionally, a line may include a sentence and a description. The target is to remove all act descriptions (e.g. entrance and exeunt) without removing any sentences.
remove_desc_hamlet <- hamlet %>%
mutate(
text = case_when(
str_detect(text, pattern = "\\[") &
str_detect(text, pattern = "\\]", negate = TRUE) ~ str_c(text, "]"),
str_detect(text, pattern = "\\]") &
str_detect(text, pattern = "\\[", negate = TRUE) ~ str_c("[", text),
TRUE ~ text
),
# create a new column that remove less meaningful lines and trim the white spaces
new_text = str_remove_all(text, pattern = "\\[.*\\]") %>% trimws(., which = "both")
) %>%
# remove empty or "not said" lines
filter(!(new_text %in% c("", "|")), !is.na(new_text), character != "play_title") %>%
select(-text) %>%
rename("text" = "new_text")
# Inspection
head(remove_desc_hamlet)
## # A tibble: 6 x 4
## character act scene text
## <chr> <chr> <chr> <chr>
## 1 SCENE I ACT I SCENE I Elsinore. A platform before the castle.
## 2 BERNARDO ACT I SCENE I Who's there?
## 3 FRANCISCO ACT I SCENE I Nay, answer me: stand, and unfold yourself.
## 4 BERNARDO ACT I SCENE I Long live the king!
## 5 FRANCISCO ACT I SCENE I Bernardo?
## 6 BERNARDO ACT I SCENE I He.
remove_desc_hamlet %>%
filter(str_detect(character, pattern = "SCENE ", negate = TRUE)) %>%
group_by(act, scene) %>%
summarise(num_of_character = n_distinct(character)) %>%
ungroup() %>%
select(-act) %>%
kbl() %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F
) %>%
pack_rows("ACT I", 1, 5) %>%
pack_rows("ACT II", 6, 7) %>%
pack_rows("ACT III", 8, 11) %>%
pack_rows("ACT IV", 12, 18) %>%
pack_rows("ACT V", 19, 19) %>%
scroll_box(width = "100%", height = "400px")
| scene | num_of_character |
|---|---|
| ACT I | |
| SCENE I | 4 |
| SCENE II | 9 |
| SCENE III | 3 |
| SCENE IV | 3 |
| SCENE V | 4 |
| ACT II | |
| SCENE I | 3 |
| SCENE II | 8 |
| ACT III | |
| SCENE I | 7 |
| SCENE II | 14 |
| SCENE III | 5 |
| SCENE IV | 4 |
| ACT IV | |
| SCENE I | 2 |
| SCENE II | 3 |
| SCENE III | 3 |
| SCENE IV | 4 |
| SCENE V | 7 |
| SCENE VI | 3 |
| SCENE VII | 4 |
| ACT V | |
| SCENE I | 4 |
Step 1: identify the location of uncommon lines said by “All”, and spotted only ACT I/III SCENE II have clause with “All”.
# inspect "All" involve which scene
remove_desc_hamlet %>%
filter(character == "All")
## # A tibble: 2 x 4
## character act scene text
## <chr> <chr> <chr> <chr>
## 1 All ACT I SCENE II Our duty to your honour.
## 2 All ACT III SCENE II Lights, lights, lights!
Step 2: Find characters involved in these lines to correct the number of lines per character involved, I did visually find who were involved in the “All” saying.
hamlet %>%
filter(act %in% c("ACT I", "ACT III"), scene == "SCENE II") %>%
mutate(label = ifelse(character == "All", "Keep", NA_character_)) %>%
group_by(act, scene) %>%
fill(label, .direction = "up") %>%
filter((label == "Keep" &
str_detect(text, pattern = "\\[|\\]")) |
(character == "All")) %>%
select(-label)
## # A tibble: 32 x 4
## # Groups: act, scene [2]
## character text act scene
## <chr> <chr> <chr> <chr>
## 1 SCENE II "[Enter KING CLAUDIUS, QUEEN GERTRUDE, HAMLET," ACT I SCENE …
## 2 SCENE II "and Attendants]" ACT I SCENE …
## 3 KING CLAUDIUS "[Exeunt VOLTIMAND and CORNELIUS]" ACT I SCENE …
## 4 HAMLET "[Aside] A little more than kin, and less than… ACT I SCENE …
## 5 KING CLAUDIUS "[Exeunt all but HAMLET]" ACT I SCENE …
## 6 HAMLET "[Enter HORATIO, MARCELLUS, and BERNARDO]" ACT I SCENE …
## 7 All " Our duty to your honour." ACT I SCENE …
## 8 SCENE II "[Enter HAMLET and Players]" ACT I… SCENE …
## 9 HAMLET "[Exeunt Players]" ACT I… SCENE …
## 10 HAMLET "[Enter POLONIUS, ROSENCRANTZ, and GUILDENSTERN… ACT I… SCENE …
## # … with 22 more rows
Note: I chose not to write a function to run through the lines to record who were still remained on stage by the time “All” lines arrived, considering visual inspection outcome (only two scene) and one time use of this script. If this task is repeatable or generalizable, then will recommend to write a function to loop through the lines.
Inspect entrance and exeunt for both scenes,
Hamlet read the most number of lines served the main character purposefully well. The following characters are sorted by descending order of number of lines each said throughout the play.
line_df <- remove_desc_hamlet %>%
group_by(character) %>%
summarise(number_of_lines = n()) %>%
ungroup() %>%
# remove scene label lines
filter(!str_detect(character, pattern = "SCENE |play_title|All")) %>%
# adjust for the "All" said
mutate(
number_of_lines = case_when(
character %in% c("HORATIO") ~ number_of_lines + 2L,
character %in% c("HAMLET","MARCELLUS", "BARNARDO", "LUCIANUS", "OPHELIA") ~ number_of_lines + 1L,
TRUE ~ number_of_lines
)
) %>%
arrange(desc(number_of_lines))
line_df %>%
ggplot() +
geom_segment(
aes(
x = reorder(character, number_of_lines),
xend = reorder(character, number_of_lines),
y = 0,
yend = number_of_lines
),
color = "#999999",
size = 1
) +
geom_point(aes(x = reorder(character, number_of_lines), y = number_of_lines),
size = 4,
color = "#E69F00") +
geom_text_repel(
aes(
x = reorder(character, number_of_lines),
y = number_of_lines,
label = number_of_lines,
),
color = "grey40",
size = 4,
nudge_y = 30,
nudge_x = 0,
direction = "y",
hjust = 0,
min.segment.length = 0,
segment.color = 'transparent'
) +
scale_y_continuous(labels=function(x) format(x, big.mark = ",", decimal.mark = ".", scientific = FALSE), limits = c(NA,1300)) +
coord_flip() +
theme_ipsum() +
labs(title = "Number of Lines Said by Each Character",
x = "Character",
y = "Number of Lines")
Further analysis are carried out by grouping of ACT and SCENE. Refer to the line chart, Hamlet does not have lines in every scene, and he had overwhelmingly more lines prior to ACT IV SCENE II (except ACT III SCENE III and few others). Comparing with Hamlet, OPHELIA and KING CLAUDIUS had more lines after that.
a1s2 <-
data.frame(
character = c("HORATIO", "MARCELLUS", "BARNARDO"),
add_line_1 = rep.int(1, 3),
act = rep("ACT I", 3),
scene = rep("SCENE II", 3)
)
a3s2 <-
data.frame(
character = c("LORD POLONIUS"),
add_line_2 = rep.int(1, 1),
act = rep("ACT III", 1),
scene = rep("SCENE II", 1)
)
line_by_scene_df <- remove_desc_hamlet %>%
filter(!str_detect(character, pattern = "SCENE |play_title|All")) %>%
group_by(act, scene, character) %>%
summarise(number_of_lines = n()) %>%
ungroup() %>%
left_join(a1s2) %>%
left_join(a3s2) %>%
mutate_at(vars(starts_with("add_line_")), ~ replace_na(., 0)) %>%
mutate(number_of_lines = number_of_lines + add_line_1 + add_line_2) %>%
arrange(desc(number_of_lines)) %>%
mutate(group_ind = group_indices_(.data = ., .dots = "act", "scene")) %>%
unite("act_scene",
c("act", "scene"),
sep = "_",
remove = FALSE)
df_end <- line_by_scene_df %>%
group_by(character) %>%
filter(group_ind == max(group_ind)) %>%
ungroup()
bpal2 <-
rep (
c(
"tomato3",
"brown",
"black",
"navy",
"steelblue",
"grey20",
"purple",
"orchid3",
"maroon",
"green4",
"cyan4"
),
3
)
names(bpal2) <- unique(line_by_scene_df$character)
color_pallate <- bpal2[1:length(unique(line_by_scene_df$character))]
plt_line_by_scene <- line_by_scene_df %>%
mutate(label = str_c("ACT: ", act, "\n",
"SCENE: ", scene, "\n",
"Character: ", character, "\n",
"No. of Lines: ", number_of_lines)) %>%
ggplot(aes(
x = act_scene,
y = number_of_lines,
group = character,
color = character,
text = label
)) +
geom_point(size = 3) +
geom_line(na.rm = TRUE) +
# geom_text_repel(
# data = df_end,
# aes(label = character),
# color = "grey40",
# size = 4,
# nudge_y = 0,
# nudge_x = 0.5,
# direction = "y",
# hjust = 0,
# min.segment.length = 0,
# segment.color = 'grey60'
# ) +
scale_x_discrete(expand = c(0.01, 0.10)) +
scale_y_continuous(trans = "log2") +
scale_color_manual(values = color_pallate) +
theme_ipsum() +
theme(legend.position = "none",
axis.text.x = element_text(angle = 45, size = 10)) +
labs(title = "Number of Lines by Act, Scene, and Character",
subtitle = "When lines continuous without dots, it means this character skip the scene without a line",
y = "Number of Lines - Log2",
x = "")
ggplotly(plt_line_by_scene, tooltip = "text")
Step 1: Transfer this data frame to a text-document format and remove descriptive rows.
prepare_scene_text_df <-
remove_desc_hamlet %>%
filter(str_detect(character, pattern = "play_title|SCENE ", negate = TRUE)) %>%
# combine ACT and SCENE to be one unique identifier of each document/scene
unite("act_scene", act:scene, sep = "_", remove = TRUE) %>%
select("doc_id" = act_scene, text)
head(prepare_scene_text_df)
## # A tibble: 6 x 2
## doc_id text
## <chr> <chr>
## 1 ACT I_SCENE I Who's there?
## 2 ACT I_SCENE I Nay, answer me: stand, and unfold yourself.
## 3 ACT I_SCENE I Long live the king!
## 4 ACT I_SCENE I Bernardo?
## 5 ACT I_SCENE I He.
## 6 ACT I_SCENE I You come most carefully upon your hour.
Step 2: Apply tidytext package (i.e. unnest_tokens and stop_words) to tokenize text and remove words that are commonly known as high frequency but contributes less to meanings of text. Followed, count occurance of each word in each scene.
tokenized_df <- prepare_scene_text_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
group_by(doc_id, word) %>%
tally() %>%
arrange(desc(n))
head(tokenized_df)
## # A tibble: 6 x 3
## # Groups: doc_id [5]
## doc_id word n
## <chr> <chr> <int>
## 1 ACT II_SCENE II lord 44
## 2 ACT III_SCENE II lord 34
## 3 ACT I_SCENE V lord 18
## 4 ACT II_SCENE I lord 18
## 5 ACT III_SCENE II love 18
## 6 ACT I_SCENE II lord 17
Step 3: apply bind_tf_idf to calculate term frequency (tf), inverse document frequency (idf), and term frequency-inverse document frequency (tf_idf) scores. The higher
# download data from lexicon package to remove spelling mistakes and optical character recognition (OCR)
data(grady_augmented)
ordered_tfidf <- tokenized_df %>%
filter(word %in% grady_augmented) %>%
bind_tf_idf(word, doc_id, n) %>%
arrange(desc(tf_idf)) %>%
ungroup() %>%
top_n(., n=100, wt = tf_idf)
head(ordered_tfidf)
## # A tibble: 6 x 6
## doc_id word n tf idf tf_idf
## <chr> <chr> <int> <dbl> <dbl> <dbl>
## 1 ACT IV_SCENE II sponge 3 0.0462 2.94 0.136
## 2 ACT IV_SCENE VI letters 4 0.0513 1.56 0.0799
## 3 ACT V_SCENE I gallows 4 0.0217 2.94 0.0640
## 4 ACT IV_SCENE VI bless 2 0.0256 2.25 0.0577
## 5 ACT IV_SCENE VI fellows 2 0.0256 2.25 0.0577
## 6 ACT IV_SCENE III fat 4 0.0222 2.25 0.0500
Use a three color scheme, it is clearly the word sponge is the most rare word in Hamlet, followed by letters, gallows and a list of other words in lavender, and then the words in tomato color. The current word cloud shows the top 100 unusual words, and it can be expanded as request.
wordcloud(ordered_tfidf$word, ordered_tfidf$tf_idf, #min.freq = 1,
max.words = 100,
colors = c("tomato3", "lavenderblush3","steelblue"))
Note: Also,
tm and qdap packages are useful to convert this dataframe into a term document matrix. Choose term document matrix (TDM) because have more terms than scenes and it is easier to organize more rows than columns.
I choose to apply two frameworks: Afinn and NRC, and compare their results.
afinnsentiments = get_sentiments('afinn')
head(afinnsentiments, 10)
## # A tibble: 10 x 2
## word value
## <chr> <dbl>
## 1 abandon -2
## 2 abandoned -2
## 3 abandons -2
## 4 abducted -2
## 5 abduction -2
## 6 abductions -2
## 7 abhor -3
## 8 abhorred -3
## 9 abhorrent -3
## 10 abhors -3
The result shows that Hamlet is a tragedy by overwhelmingly negative sentiment scores throughout the place. Reached its lowest score -1.18 during ACT II SCENE I, while the highest was met during ACT IV_SCENE III 0.57. This play is overwhelmingly negative in sentiment.
afinn_df <-
tokenized_df %>%
inner_join(afinnsentiments) %>%
group_by(doc_id) %>%
add_tally() %>%
tally(mean(value)) %>%
arrange(doc_id)
afinn_df %>%
ggplot(aes(x = fct_reorder(doc_id,n), y = n)) +
geom_col()+
theme_ipsum() +
coord_flip() +
labs(title = "AFINN Sentiment Score by Scene",
subtitle = "Descending ordered by sentiment scores",
x = "",
y = "Score") +
scale_y_continuous(expand = c(-0.5, 1))
nrcsentiments = get_sentiments('nrc')
head(nrcsentiments)
## # A tibble: 6 x 2
## word sentiment
## <chr> <chr>
## 1 abacus trust
## 2 abandon fear
## 3 abandon negative
## 4 abandon sadness
## 5 abandoned anger
## 6 abandoned fear
nrc_df <- tokenized_df %>%
inner_join(nrcsentiments) %>%
group_by(doc_id, sentiment) %>%
summarise(n = sum(n)) %>%
ungroup() %>%
group_by(doc_id) %>%
mutate(value = n / sum(n)) %>%
ungroup() %>%
mutate(value = scales::percent(value, accuracy = 0.01)) %>%
select(-n) %>%
pivot_wider(names_from = "sentiment", values_from = "value") %>%
select(doc_id, positive, negative, everything(.)) %>%
separate(col = doc_id,
into = c("act", "scene"),
sep = "_") %>%
select(-act)
kbl(nrc_df) %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = F
) %>%
add_header_above(c(
" " = 1,
"Sentiment" = 2,
"Emotion" = 8
)) %>%
pack_rows("ACT I", 1, 5) %>%
pack_rows("ACT II", 6, 7) %>%
pack_rows("ACT III", 8, 11) %>%
pack_rows("ACT IV", 12, 18) %>%
pack_rows("ACT V", 19, 19) %>%
scroll_box(width = "100%", height = "600px")
| scene | positive | negative | anger | anticipation | disgust | fear | joy | sadness | surprise | trust |
|---|---|---|---|---|---|---|---|---|---|---|
| ACT I | ||||||||||
| SCENE I | 20.17% | 16.48% | 8.81% | 11.36% | 3.41% | 11.93% | 6.82% | 7.39% | 4.83% | 8.81% |
| SCENE II | 20.63% | 17.49% | 3.63% | 7.76% | 5.94% | 7.76% | 7.26% | 8.42% | 4.62% | 16.50% |
| SCENE III | 21.61% | 14.99% | 5.76% | 8.65% | 5.48% | 8.65% | 10.09% | 4.03% | 4.90% | 15.85% |
| SCENE IV | 22.49% | 20.71% | 5.92% | 5.92% | 9.47% | 11.24% | 5.92% | 4.73% | 1.78% | 11.83% |
| SCENE V | 18.42% | 18.05% | 8.08% | 6.58% | 9.40% | 11.09% | 6.58% | 5.64% | 4.89% | 11.28% |
| ACT II | ||||||||||
| SCENE I | 21.67% | 16.33% | 5.67% | 5.33% | 8.67% | 8.00% | 6.67% | 5.33% | 5.67% | 16.67% |
| SCENE II | 21.03% | 15.30% | 6.61% | 7.75% | 7.81% | 8.19% | 8.88% | 7.24% | 3.72% | 13.48% |
| ACT III | ||||||||||
| SCENE I | 20.24% | 14.88% | 6.75% | 6.75% | 7.74% | 9.52% | 7.94% | 8.73% | 4.37% | 13.10% |
| SCENE II | 22.64% | 16.18% | 4.18% | 7.27% | 7.36% | 6.64% | 9.55% | 7.55% | 4.55% | 14.09% |
| SCENE III | 16.19% | 18.62% | 8.50% | 8.10% | 6.48% | 9.31% | 7.69% | 8.50% | 4.05% | 12.55% |
| SCENE IV | 19.09% | 18.75% | 6.42% | 7.94% | 6.25% | 9.63% | 7.26% | 8.95% | 4.05% | 11.66% |
| ACT IV | ||||||||||
| SCENE I | 11.43% | 20.00% | 12.38% | 6.67% | 9.52% | 15.24% | 3.81% | 5.71% | 4.76% | 10.48% |
| SCENE II | 33.33% | 25.49% | 1.96% | 3.92% | 11.76% | 3.92% | 1.96% | NA | NA | 17.65% |
| SCENE III | 17.61% | 18.31% | 4.23% | 11.97% | 8.45% | 4.93% | 6.34% | 9.86% | 6.34% | 11.97% |
| SCENE IV | 22.54% | 13.29% | 6.36% | 9.83% | 5.20% | 10.98% | 6.36% | 5.78% | 4.05% | 15.61% |
| SCENE V | 17.67% | 13.38% | 5.83% | 9.61% | 5.49% | 9.43% | 9.78% | 8.06% | 6.52% | 14.24% |
| SCENE VI | 23.21% | 17.86% | 8.93% | 10.71% | 5.36% | 7.14% | 5.36% | 3.57% | 1.79% | 16.07% |
| SCENE VII | 17.28% | 16.82% | 6.91% | 9.68% | 6.91% | 8.53% | 6.68% | 8.53% | 5.99% | 12.67% |
| ACT V | ||||||||||
| SCENE I | 14.52% | 17.74% | 8.06% | 6.99% | 2.69% | 14.52% | 6.99% | 15.05% | 4.30% | 9.14% |
Note: Scroll up for the end of the table.
radar_df <- tokenized_df %>%
inner_join(nrcsentiments) %>%
group_by(doc_id, sentiment) %>%
summarise(n = sum(n)) %>%
ungroup() %>%
group_by(doc_id) %>%
mutate(value = n / sum(n)) %>%
ungroup() %>%
select(-n) %>%
pivot_wider(names_from = "sentiment", values_from = "value") %>%
select(-c("positive", "negative")) %>%
separate(col = doc_id,
into = c("act", "scene"),
sep = "_") %>%
summarise_at(vars(anger:trust), ~ median(., na.rm = TRUE)) %>%
rbind(rep(0.15,10), rep(0,10), .)
radarchart(
radar_df,
axistype = 1 ,
pcol = rgb(0.2, 0.5, 0.5, 0.9) ,
pfcol = rgb(0.2, 0.5, 0.5, 0.5) ,
plwd = 4 ,
cglcol = "grey",
cglty = 2,
axislabcol = "grey",
caxislabels = seq(0, 20, 5),
cglwd = 0.8,
vlcex = 0.8
)
To conclude, this report addressed three key questions:
Highlight: I noticed the unusual line titled to “All” and adjust for the characters involved. Results have been presented using a lolliplot, and then in a line chart to show number of lines said by each character towards the timeline of the play.
Highlight: I removed common stopwords, separated the play by scene (this can also be done by character), and apply tf-idf to calculate rareness of each word to the entire play. Outcome has been presented in a word cloud by request.
Highlight: I compared results by applying AFINN and NRC frameworks over the text tokens. Advantages of using AFINN includes but not limited to faster computation over larger datasets, and simple interpretable results. On the other hand, NRC offers additional eight dimensions of human-touch emotions, the result is more comprehensive and humanized. Results have been compared among a column chart, a table, and a radarchart.
Additional character analysis can be carried out following the idea proposed in Nalisnick and Baird (2013), refer to reference 2. More sentiment frameworks can be tested, for instance, “bing” (positive / negative).
If analysis is carried out by less technical users, interactive visualization should be developed, and Shiny is a good complementary tool with this reports.
All suggestions are welcome.
Basic Radar Chart. URL: https://www.r-graph-gallery.com/142-basic-radar-chart.html
Nalisnick, E. T., & Baird, H. S. (2013, August). Character-to-character sentiment analysis in Shakespeare’s plays. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 479-483). URL: https://www.aclweb.org/anthology/P13-2085.pdf
Yann, R. (2020) A Short Guide to Historical Newspaper Data, Using R. URL: https://bookdown.org/yann_ryan/r-for-newspaper-data/calculating-tf-idf-scores-with-tidytext.html
Zhu, H. (2020). Create Awesome HTML Table with knitr::kable and kableExtra. URL: https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html